individual <-read.csv("/Users/danagonzalez/Downloads/chs_individual.csv")regional <-read.csv("/Users/danagonzalez/Downloads/chs_regional.csv")combined <-merge(individual, regional, by ="townname", all =FALSE)nrow(combined)
[1] 1200
summary(combined)
townname sid male race
Length:1200 Min. : 1.0 Min. :0.0000 Length:1200
Class :character 1st Qu.: 528.8 1st Qu.:0.0000 Class :character
Mode :character Median :1041.5 Median :0.0000 Mode :character
Mean :1037.5 Mean :0.4917
3rd Qu.:1554.2 3rd Qu.:1.0000
Max. :2053.0 Max. :1.0000
hispanic agepft height weight
Min. :0.0000 Min. : 8.961 Min. :114 Min. : 42.00
1st Qu.:0.0000 1st Qu.: 9.610 1st Qu.:135 1st Qu.: 65.00
Median :0.0000 Median : 9.906 Median :139 Median : 74.00
Mean :0.4342 Mean : 9.924 Mean :139 Mean : 79.33
3rd Qu.:1.0000 3rd Qu.:10.177 3rd Qu.:143 3rd Qu.: 89.00
Max. :1.0000 Max. :12.731 Max. :165 Max. :207.00
NA's :89 NA's :89 NA's :89
bmi asthma active_asthma father_asthma
Min. :11.30 Min. :0.0000 Min. :0.00 Min. :0.00000
1st Qu.:15.78 1st Qu.:0.0000 1st Qu.:0.00 1st Qu.:0.00000
Median :17.48 Median :0.0000 Median :0.00 Median :0.00000
Mean :18.50 Mean :0.1463 Mean :0.19 Mean :0.08318
3rd Qu.:20.35 3rd Qu.:0.0000 3rd Qu.:0.00 3rd Qu.:0.00000
Max. :41.27 Max. :1.0000 Max. :1.00 Max. :1.00000
NA's :89 NA's :31 NA's :106
mother_asthma wheeze hayfever allergy
Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
Median :0.0000 Median :0.0000 Median :0.0000 Median :0.0000
Mean :0.1023 Mean :0.3313 Mean :0.1747 Mean :0.2929
3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:1.0000
Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
NA's :56 NA's :71 NA's :118 NA's :63
educ_parent smoke pets gasstove
Min. :1.000 Min. :0.0000 Min. :0.0000 Min. :0.0000
1st Qu.:2.000 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:1.0000
Median :3.000 Median :0.0000 Median :1.0000 Median :1.0000
Mean :2.797 Mean :0.1638 Mean :0.7667 Mean :0.7815
3rd Qu.:3.000 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
Max. :5.000 Max. :1.0000 Max. :1.0000 Max. :1.0000
NA's :64 NA's :40 NA's :33
fev fvc mmef pm25_mass
Min. : 984.8 Min. : 895 Min. : 757.6 Min. : 5.960
1st Qu.:1809.0 1st Qu.:2041 1st Qu.:1994.0 1st Qu.: 7.615
Median :2022.7 Median :2293 Median :2401.5 Median :10.545
Mean :2031.3 Mean :2324 Mean :2398.8 Mean :14.362
3rd Qu.:2249.7 3rd Qu.:2573 3rd Qu.:2793.8 3rd Qu.:20.988
Max. :3323.7 Max. :3698 Max. :4935.9 Max. :29.970
NA's :95 NA's :97 NA's :106
pm25_so4 pm25_no3 pm25_nh4 pm25_oc
Min. :0.790 Min. : 0.730 Min. :0.4100 Min. : 1.450
1st Qu.:1.077 1st Qu.: 1.538 1st Qu.:0.7375 1st Qu.: 2.520
Median :1.815 Median : 2.525 Median :1.1350 Median : 4.035
Mean :1.876 Mean : 4.488 Mean :1.7642 Mean : 4.551
3rd Qu.:2.605 3rd Qu.: 7.338 3rd Qu.:2.7725 3rd Qu.: 5.350
Max. :3.230 Max. :12.200 Max. :4.2500 Max. :11.830
pm25_ec pm25_om pm10_oc pm10_ec
Min. :0.1300 Min. : 1.740 Min. : 1.860 Min. :0.1400
1st Qu.:0.4000 1st Qu.: 3.020 1st Qu.: 3.228 1st Qu.:0.4100
Median :0.5850 Median : 4.840 Median : 5.170 Median :0.5950
Mean :0.7358 Mean : 5.460 Mean : 5.832 Mean :0.7525
3rd Qu.:1.1750 3rd Qu.: 6.418 3rd Qu.: 6.855 3rd Qu.:1.1975
Max. :1.3600 Max. :14.200 Max. :15.160 Max. :1.3900
pm10_tc formic acetic hcl
Min. : 1.990 Min. :0.340 Min. :0.750 Min. :0.2200
1st Qu.: 3.705 1st Qu.:0.720 1st Qu.:2.297 1st Qu.:0.3250
Median : 6.505 Median :1.105 Median :2.910 Median :0.4350
Mean : 6.784 Mean :1.332 Mean :3.010 Mean :0.4208
3rd Qu.: 8.430 3rd Qu.:1.765 3rd Qu.:4.000 3rd Qu.:0.4625
Max. :16.440 Max. :2.770 Max. :5.140 Max. :0.7300
hno3 o3_max o3106 o3_24
Min. :0.430 Min. :38.27 Min. :28.22 Min. :18.22
1st Qu.:1.593 1st Qu.:49.93 1st Qu.:41.90 1st Qu.:23.31
Median :2.455 Median :64.05 Median :46.74 Median :27.59
Mean :2.367 Mean :60.16 Mean :47.76 Mean :30.23
3rd Qu.:3.355 3rd Qu.:67.69 3rd Qu.:55.24 3rd Qu.:32.39
Max. :4.070 Max. :84.44 Max. :67.01 Max. :57.76
no2 pm10 no_24hr pm2_5_fr
Min. : 4.60 Min. :18.40 Min. : 2.05 Min. : 9.01
1st Qu.:12.12 1st Qu.:20.71 1st Qu.: 4.74 1st Qu.:10.28
Median :16.40 Median :29.64 Median :12.68 Median :22.23
Mean :18.99 Mean :32.64 Mean :16.21 Mean :19.79
3rd Qu.:23.24 3rd Qu.:39.16 3rd Qu.:26.90 3rd Qu.:27.73
Max. :37.97 Max. :70.39 Max. :42.95 Max. :31.55
NA's :100 NA's :300
iacid oacid total_acids lon
Min. :0.760 Min. :1.090 Min. : 1.520 Min. :-120.7
1st Qu.:1.835 1st Qu.:2.978 1st Qu.: 4.930 1st Qu.:-118.8
Median :2.825 Median :4.135 Median : 6.370 Median :-117.7
Mean :2.788 Mean :4.342 Mean : 6.708 Mean :-118.3
3rd Qu.:3.817 3rd Qu.:5.982 3rd Qu.: 9.395 3rd Qu.:-117.4
Max. :4.620 Max. :7.400 Max. :11.430 Max. :-116.8
lat
Min. :32.84
1st Qu.:33.93
Median :34.10
Mean :34.20
3rd Qu.:34.65
Max. :35.49
Impute Data
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
# A tibble: 4 × 3
smoke_gas_exposure average_fev sd_fev
<chr> <dbl> <dbl>
1 Both 2026. 300.
2 Gas Stove Only 2024. 319.
3 Neither 2059. 328.
4 Second Hand Smoke Only 2057. 293.
Exploratory Data Analysis
Association between BMI and Forced Expiratory Volume (FEV)
library(ggplot2)ggplot(data = combined, mapping =aes(x = bmi, y = fev)) +geom_point() +geom_smooth(method ="loess", col ="pink", se =FALSE) +labs(title ="Scatterplot of BMI vs Forced Expiratory Volume (mL/sec)", x ="BMI", y ="FEV (mL)")
`geom_smooth()` using formula = 'y ~ x'
Based on this preliminary visualization, there seems to be a positive association between BMI and FEV. This relationship is maintained until a BMI level of about 30, where the relationship becomes slightly negative.
Association between Smoke and Gas Exposure and Forced Expiratory Volume (FEV)
labels_data <- combined %>%group_by(smoke_gas_exposure) %>%summarise(mean_fev =mean(fev, na.rm =TRUE))combined |>ggplot(mapping =aes(x = smoke_gas_exposure, y = fev, fill = smoke_gas_exposure)) +geom_boxplot() +scale_fill_brewer(palette ="RdPu") +labs(title ="Forced Expiratory Volume (mL/sec) by Smoke and Gas Exposure",x ="Smoke and Gas Exposure", y ="Forced Expiratory Volume (mL/sec)") +geom_text(data = labels_data, aes(x = smoke_gas_exposure, y = mean_fev, label =round(mean_fev, 1)),vjust =-0.75, color ="black", size =3) +theme_minimal()
Based on this preliminary visualization, there do not seem to be significant differences in FEV across smoke and gas exposure categories, although further analysis is required.
Association between PM2.5 Exposure and Forced Expiratory Volume (FEV)
ggplot(data = combined, mapping =aes(x = pm25_exposure, y = fev)) +geom_point() +geom_smooth(method ="loess", col ="pink", se =FALSE) +labs(title ="Scatterplot of PM2.5 Exposure vs Forced Expiratory Volume (mL/sec)", x ="PM2.5 Exposure", y ="FEV (mL/sec)")
`geom_smooth()` using formula = 'y ~ x'
Based on this preliminary visualization, there seems to be a slightly negative (although weak) relationship between PM2.5 exposure and forced expiratory volume (FEV).
Data Visualization
Scatterplots of BMI vs FEV by Town
combined[!is.na(combined$townname), ] |>ggplot() +geom_point(mapping =aes(x = bmi, y = fev)) +facet_wrap(~ townname, scales ="free") +geom_smooth(mapping =aes(x = bmi, y = fev), method ="loess", col ="pink", se =FALSE) +labs(title ="Scatterplots of BMI vs Forced Expiratory Volume (FEV) by Town", x ="BMI", y ="FEV (mL/sec)")
`geom_smooth()` using formula = 'y ~ x'
Although the associations between BMI and FEV vary between towns, most seem to be positive and strong in nature (although further analysis is required to fully determine this.) Further, some towns (like Altascadero) have a more linear relationship between the two variables of interest relative to other towns (like Alpine, Lake Gregory, and Mira Loma). It’s also important to note that some towns (like Lompoc, Mira Loma, and San Dimas) have regression lines that turn negative (downward) with higher BMI values, although further analysis is required to investigate the possible cause of this.
Stacked histograms of FEV by BMI category
ggplot(combined, aes(x = fev, fill =factor(obesity_level))) +geom_histogram(position ="stack", bins =25) +labs(title ="Stacked Histogram of FEV by BMI Category",x ="FEV (mL/sec)",y ="Count") +scale_fill_brewer(palette ="RdPu") +theme_minimal()
Based on this stacked histogram, we can immediately see that the majority of observations for the BMI variable fall under the “Normal” level, with far less for “Obese”, “Overweight”, and “Underweight” (in descending order). Too, most observations for the “Normal” category are concentrated around an FEV value of 2000, with counts of observations tapering off in either direction from this peak (unimodal, normal distribution). Although the distributions for the remaining three categories are also unimodal (with the exception of “Overweight”), their respective peaks are shifted (“Obese” = 2250, “Underweight” = 1650).
Stacked histograms of FEV by Smoke and Gas Exposure Category
ggplot(combined, aes(x = fev, fill =factor(smoke_gas_exposure))) +geom_histogram(position ="stack", bins =25) +labs(title ="Stacked Histogram of FEV by Smoke and Gas Exposure Category",x ="FEV (mL/sec)",y ="Count") +scale_fill_brewer(palette ="RdPu") +theme_minimal()
Unlike the previous stacked histogram, the majority of observations come from two categories: “Both” and “Gas Stove Only”. Too, the distributions for all four categories seem to me unimodal and normally distributed, with respective peaks concentrated around an FEV value of 2000.
Barchart of BMI by Smoke and Gas Exposure.
ggplot(data = combined, aes(x =cut(bmi, breaks =15), fill =factor(smoke_gas_exposure))) +geom_bar(position ="stack") +scale_fill_brewer(palette ="RdPu") +labs(title ="Stacked Bar Chart of BMI by Smoke and Gas Exposure",x ="BMI",y ="Count") +theme_minimal()
This stacked bar chart shows that the majority of BMI data falls under the “Both” and “Gas Stove Only” categories. Too, while this distribution is unimodal (centered at BMI values around 15-19), the entire distribution is skewed to the right (smaller BMI values).
Boxplot (Statistical Summary Graph) of FEV by Obesity Level
labels_data2 <- combined %>%group_by(obesity_level) %>%summarise(mean_fev =mean(fev, na.rm =TRUE))ggplot(data = combined, aes(x = obesity_level, y = fev, fill = obesity_level)) +geom_boxplot() +labs(title ="Boxplot of Forced Expiratory Volume by Obesity Level",x ="Obesity Level",y ="FEV (mL/sec)") +scale_fill_brewer(palette ="RdPu") +geom_text(data = labels_data2, aes(x = obesity_level, y = mean_fev, label =round(mean_fev, 1)),vjust =-0.75, color ="black", size =3) +theme_minimal()
Comparing the boxplots of FEV across obesity levels we can see distinct differences in the median values for each BMI category. While the medians for “Obese” and “Overweight” are relatively close (also the two highest medians across categories), the median for the “Normal” group is slightly less. The median FEV value for the “Underweight” group is the lowest of the four groups (around 300 units less than the “Normal” median, and around 450 units less than for the remaining two categories).
Boxplot (Statistical Summary Graph) of FEV by Smoke and Gas Exposure Category
labels_data <- combined %>%group_by(smoke_gas_exposure) %>%summarise(mean_fev =mean(fev, na.rm =TRUE))ggplot(data = combined, aes(x = smoke_gas_exposure, y = fev, fill = smoke_gas_exposure)) +geom_boxplot() +labs(title ="Boxplot of Forced Expiratory Volume by Smoke and Gas Exposure Category",x ="Smoke and Gas Exposure Category",y ="FEV (mL/sec)") +scale_fill_brewer(palette ="RdPu") +geom_text(data = labels_data, aes(x = smoke_gas_exposure, y = mean_fev, label =round(mean_fev, 1)),vjust =-0.75, color ="black", size =3) +theme_minimal()
As discussed previously, the median FEV values across the four smoke and gas exposure categories are relatively similar. Of the four groups, the “Neither” group had the highest median FEV value at 2059.1 mL/second, and the “Gas Stove Only” group had the lowest at 2023.5 mL/second. However, further analysis is required to determine if these differences in median values are statistically significant.
Map showing the concentrations of PM2.5 mass in each of the CHS communities
This leaflet map showcasing PM2.5 mass concentrations across the 12 communities in this study points to greater concentrations in communities closer to Los Angeles. This makes sense, as urban contributors to air pollution and air quality likely have a heavy role in PM2.5 mass. We see the smallest mass concentration in the northern-most communities in this study (which are also located along or closer to California’s coast, and thus may benefit geographically in overall air quality).
PM2.5 mass and FEV Associations.
summary(combined$pm25_mass)
Min. 1st Qu. Median Mean 3rd Qu. Max.
5.960 7.615 10.545 14.362 20.988 29.970
ggplot(data = combined, mapping =aes(x = pm25_mass, y = fev)) +geom_point() +geom_smooth(method ="loess", col ="pink", se =FALSE) +labs(title ="Scatterplot of PM2.5 Mass vs Forced Expiratory Volume (mL/sec)", x ="PM2.5 Mass", y ="FEV (mL/sec)") +xlim(5.96, 29.97)
`geom_smooth()` using formula = 'y ~ x'
Based on this scatter plot, there seems to be a negative (although weak) association between PM2.5 mass and forced expiratory volume (FEV).